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Abstract 

An active learner is given a class of models, a large set of unlabeled examples, and the ability 
to interactively query labels of a subset of these examples; the goal of the learner is to learn a 
model in the class that fits the data well. 

Previous theoretical work has rigorously characterized label complexity of active learning, 
but most of this work has focused on the PAC or the agnostic PAC model. In this paper, 
we shift our attention to a more general setting - maximum likelihood estimation. Provided 
certain conditions hold on the model class, we provide a two-stage active learning algorithm 
for this problem. The conditions we require are fairly general, and cover the widely popular 
class of Generalized Linear Models, which in turn, include models for binary and multi-class 
classification, regression, and conditional random fields. 

We provide an upper bound on the label requirement of our algorithm, and a lower bound 
that matches it up to lower order terms. Our analysis shows that unlike binary classification in 
the realizable case, just a single extra round of interaction is sufficient to achieve near-optimal 
performance in maximum likelihood estimation. On the empirical side, the recent work in m 
and [1^ (on active linear and logistic regression) shows the promise of this approach. 


1 Introduction 

In active learning, we are given a sample space X, a label space T, a class of models that map X to 
T, and a large set U of unlabelled samples. The goal of the learner is to learn a model in the class 
with small target error while interactively querying the labels of as few of the unlabelled samples as 
possible. 

Most theoretical work on active learning has focussed on the PAC or the agnostic PAC model, 
where the goal is to learn binary classifiers that belong to a particular hypothesis class Enniniiaii 
and there has been only a handful of exceptions [HI IB US]- In this paper, we shift our attention 
to a more general setting - maximum likelihood estimation (MLE), where Pr(V|A') is described by a 
model 6 belonging to a model class 0. We show that when data is generated by a model in this class, 
we can do active learning provided the model class 0 has the following simple property: the Fisher 
information matrix for any model 0 G 0 at any (x, y) depends only on x and 6. This condition is 
satisfied in a number of widely applicable model classes, such as Linear Regression and Generalized 
Linear Models (GLMs), which in turn includes models for Multiclass Classification and Conditional 
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Random Fields. Consequently, we can provide active learning algorithms for maximum likelihood 
estimation in all these model classes. 

The standard solution to active MLE estimation in the statistics literature is to select samples 
for label query by optimizing a class of summary statistics of the asymptotic covariance matrix of 
the estimator [5]. The literature, however, does not provide any guidance towards which summary 
statistic should be used, or any analysis of the solution quality when a finite number of labels or 
samples are available. There has also been some recent work in the machine learning community m 
[T2l[T9] on this problem; but these works focus on simple special cases (such as linear regression [I^[TT| 
or logistic regression [HD, and only m involves a consistency and finite sample analysis. 

In this work, we consider the problem in its full generality, with the goal of minimizing the 
expected log-likelihood error over the unlabelled data. We provide a two-stage active learning 
algorithm for this problem. In the first stage, our algorithm queries the labels of a small number 
of random samples from the data distribution in order to construct a crude estimate 6i of the 
optimal parameter 0*. In the second stage, we select a set of samples for label query by optimizing a 
summary statistic of the covariance matrix of the estimator at 9i; however, unlike the experimental 
design work, our choice of statistic is directly motivated by our goal of minimizing the expected 
log-likelihood error, which guides us towards the right objective. 

We provide a finite sample analysis of our algorithm when some regularity conditions hold and 
when the negative log likelihood function is convex. Our analysis is still fairly general, and applies 
to Generalized Linear Models, for example. We match our upper bound with a corresponding lower 
bound, which shows that the convergence rate of our algorithm is optimal (except for lower order 
terms); the finite sample convergence rate of any algorithm that uses (perhaps multiple rounds of) 
sample selection and maximum likelihood estimation is either the same or higher than that of our 
algorithm. This implies that unlike what is observed in learning binary classifiers, a single round of 
interaction is sufficient to achieve near-optimal log likelihood error for ML estimation. 

1.1 Related Work 

Previous theoretical work on active learning has focussed on learning a classifier belonging to a 
hypothesis class TL in the PAG model. Both the realizable and non-realizable cases have been 
considered. In the realizable case, a line of work mm has looked at a generalization of binary 
search; while their algorithms enjoy low label complexity, this style of algorithms is inconsistent in 
the presence of noise. The two main styles of algorithms for the non-realizable case are disagreement- 
based active learning miiii], and margin or confidence-based active learning mm- While active 
learning in the realizable case has been shown to achieve an exponential improvement in label 
complexity over passive learning [2l[6l[T3], in the agnostic case, the gains are more modest (sometimes 
a constant factor) [181 1^ [7]. Moreover, lower bounds [14] show that the label requirement of any 
agnostic active learning algorithm is always at least r2(i/^/e^), where ly is the error of the best 
hypothesis in the class, and e is the target error. In contrast, our setting is much more general 
than binary classification, and includes regression, multi-class classification and certain kinds of 
conditional random fields that are not covered by previous work. 

|19j provides an active learning algorithm for linear regression problem under model mismatch. 
Their algorithm attempts to learn the location of the mismatch by fitting increasingly refined par¬ 
titions of the domain, and then uses this information to reweight the examples. If the partition is 
highly refined, then the computational complexity of the resulting algorithm may be exponential 
in the dimension of the data domain. In contrast, our algorithm applies to a more general setting, 
and while we do not address model mismatch, our algorithm has polynomial time complexity. [1] 
provides an active learning algorithm for Generalized Linear Models in an online selective sampling 
setting; however, unlike ours, their input is a stream of unlabelled examples, and at each step, they 
need to decide whether the label of the current example should be queried. 

Our work is also related to the classical statistical work on optimal experiment design, which 
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mostly considers maximum likelihood estimation [^. For uni-variate estimation, they suggest se¬ 
lecting samples to maximize the Fisher information which corresponds to minimizing the variance 
of the regression coefficient. When 9 is multi-variate, the Fisher information is a matrix; in this 
case, there are multiple notions of optimal design which correspond to maximizing different param¬ 
eters of the Fisher information matrix. For example, D-optimality maximizes the determinant, and 
A-optimality maximizes the trace of the Fisher information. In contrast with this work, we directly 
optimize the expected log-likelihood over the unlabelled data which guides us to the appropriate 
objective function; moreover, we provide consistency and hnite sample guarantees. 

Finally, on the empirical side, m and m derive algorithms similar to ours for logistic and linear 
regression based on projected gradient descent. Notably, these works provide promising empirical 
evidence for this approach to active learning; however, no consistency guarantees or convergence 
rates are provided (the rates presented in these works are not stated in terms of the sample size). 
In contrast, our algorithm applies more generally, and we provide consistency guarantees and con¬ 
vergence rates. Moreover, unlike m, our logistic regression algorithm uses a single extra round of 
interaction, and our results illustrate that a single round is sufficient to achieve a convergence rate 
that is optimal except for lower order terms. 


2 The Model 

We begin with some notation. We are given a pool U = {xi, ..., Xn} of n unlabelled examples drawn 
from some instance space A, and the ability to interactively query labels belonging to a label space 
3^ of m of these examples. In addition, we are given a family of models M = {p{y\x,9),9 G 0} 
parameterized by 0 £ 0 C K.'^. We assume that there exists an unknown parameter 0* £ 0 such 
that querying the label of an Xi G U generates a pi drawn from the distribution p[y\xi, 6*). We also 
abuse notation and use U to denote the uniform distribution over the examples in U. 

We consider the fixed-design (or transductive) setting, where our goal is to minimize the error 
on the fixed set of points U. For any x G X,y G y and 0 £ 0, we define the negative log-likelihood 
function L{y\x, 0) as: 

L{y\x,0) = -\ogp{y\x,0) 

Our goal is to find a 0 to minimize Lij(9), where 

Lu{G) = '^x^u,Y^p{Y\xp*)[L{Y\X, 0 )] 

by interactively querying labels for a subset of U of size m, where we allow label queries with 
replacement i.e., the label of an example may be queried multiple times. 

An additional quantity of interest to us is the Fisher information matrix, or the Hessian of the 
negative log-likelihood L(y\x,9) function, which determines the convergence rate. For our active 
learning procedure to work correctly, we require the following condition. 

Condition 1. For any xGX,yGy,6GQ, the Fisher information — « function of only 
X and 0 (and does not depend on y.) 

Condition [T] is satisfied by a number of models of practical interest; examples include linear 
regression and generalized linear models. Section l5. II provides a brief derivation of Condition [T] for 
generalized linear models. 

For any x, y and 0, we use I{x, 9) to denote the Hessian ^ observe that by AssumptionfU 

this is just a function of x and 0. Let F be any distribution over the unlabelled samples in U ; for 
any 0 £ 0, we use: 

/r(0) =Ex..r[/(A,0)] 
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Algorithm 1 ActiveSetSelect 
Input: Samples Xi, for i = 1, • • • ,n 
1: Draw mi samples u.a.r from U, and query their labels to get 
2: Use Si to solve the MLE problem: 

6>i = argmingge ^ L{yi\xi,d) 
(xi,yi)£Si 


3: Solve the following SDP (refer Lemma |31): 


* 

a 


argmin^Tr(S' ^Iu{di)) 



0 < a* < 1 
Y,. ai = 7712 


4: Draw 7712 examples using probability F = oFi -\- (1 — a)U where the distribution Fi = — and 

a = 1 — 7712 ^ ■ Query their labels to get S 2 ■ 

5; Use S 2 to solve the MLE problem: 


6»2 = argmingge ^ L{yi\xi,0) 
ixi,yi)&S2 


Output: O 2 


3 Algorithm 

The main idea behind our algorithm is to sample Xi from a well-designed distribution F over 
U, query the labels of these samples and perform ML estimation over them. To ensure good per¬ 
formance, F should be chosen carefully, and our choice of F is motivated by Lemma [1] Suppose 
the labels yt are generated according to: yi ^ p[y\xi,9*). Lemma [T] states that the expected log- 
likelihood error of the ML estimate with respect to m samples from F in this case is essentially 
Tr(/r(r)-iJ[/(r))/m. 

This suggests selecting F as the distribution F* that minimizes Tr (/p. (0*)“^ . Unfortu¬ 

nately, we cannot do this as 9* is unknown. We resolve this problem through a two stage algorithm; 
in the first stage, we use a small number mi of samples to construct a coarse estimate 0i of 9* (Steps 
1-2). In the second stage, we calculate a distribution Fi which minimizes Tr {Iy-^{9i)~^I u(9i)') and 
draw samples from (a slight modification of) this distribution for a finer estimation of 9* (Steps 3-5). 
The distribution Fi is modified slightly to f (in Step 4) to ensure that If [9*) is well conditioned 
with respect to Iu{9*). 

The algorithm is formally presented in Algorithm [TJ 

Finally, note that Steps 1-2 are necessary because Ijj and /p are functions of 9. In certain 
special cases such as linear regression, /j/ and If are independent of 9. In those cases. Steps 1-2 are 
unnecessary, and we may skip directly to Step 3. 


4 Performance Guarantees 

The following regularity conditions are essentially a quantified version of the standard Local Asymp¬ 
totic Normality (LAN) conditions for studying maximum likelihood estimation (see [TOl 1^ 1. 

Assumption 1. (Regularity conditions for LAN) 
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1. Smoothness: The first three derivatives of L{y\x,9) exist in all interior points of Q C 

2. Compactness: 0 is compact and 9* is an interior point of 0. 

3. Strong Convexity: Iu{9*) = ^ ^ ^*) positive definite with smallest singular value 

^min ^ 0 - 


f. Lipschitz continuity: There exists a neighborhood B of 9* and a constant L 3 such that for 
all X GU, I{x, 9) is L^-Lipschitz in this neighborhood. 

\lu{9*)-^'^ (/ (x, 9) - I (x, 9')) < L 3 ||0 - , 

for every 9,9' G B. 

5. Concentration at 9*: For any x GU and y, we have (with probability one), 


||VL(y|x,r)|l,^(,.)-i <Li, and I {x,9*) Iu{9*) 


s-l/2 


< Lo. 


6 . Boundedness: max(2; y) supgg0 |L(x, 2/|0)| < R. 

In addition to the above, we need one extra condition which is essentially a pointwise self con¬ 
cordance. This condition is satisfied by a vast class of models, including the generalized linear 
models. 


Assumption 2. Point-wise self concordance: 


-L4\\9-9*\\.^I{x,9*) <I[x,9)-I{x, 9*) < - 9*\\^I [x,9*). 


Definition 1. [Optimal Sampling Distribution T*] We define the optimal sampling distribution 
r* over the points in U as the distribution T* = ,..., j*) for which 7 * > 0, 7z* = 

Tr(/r.(r)-i/c/(0*)) IS as small as possible. 

Definition [T] is motivated by Lemma [1] which indicates that under some mild regularity condi¬ 
tions, a ML estimate calculated on samples drawn from L* will provide the best convergence rates 
(including the right constant factor) for the expected log-likelihood error. 

We now present the main result of our paper. The proof of the following theorem and all the 
supporting lemmas will be presented in Appendix |Al 

Theorem 1. Suppose the regularity conditions in Assumptions[l\ and\^ hold. Let fi > 10, and the 

number of samples used in step (1) be mi > O ^max (^2 log^ d, L\ -|- log^ d, ^ Tr 

Then with probability >1 — 5, the expected log likelihood error of the estimate 02 of Algorithm\l\ is 
bounded as: 

E[Lui92)]-Lu{9*)<(l + ^) {l+fim,)Tr(lr^{9*)-^Iui9*))— + ^, (1) 

\ p — IJ V y m2 m2 


where T* is the optimal sampling distribution in Definition{J\and 6^2 = ^(LiLa -|- 

Moreover, for any sampling distribution T satisfying /r(0*) > clu{9*) and label constraint of m 2 , 
we have the following lower bound on the expected log likelihood error for ML estimate: 


E 


LuPr)] -Lui9*) > ( 1 -e^J Tr(Lr{9*y"lu{9*)) - 

J V / m 2 


Ll 


cm. 


2 ’ 


( 2 ) 


7 def em2 

where . 

Remark 1. (Restricting to Maximum Likelihood Estimation) Our restriction to maximum likelihood 
estimators is minor, as this is close to minimax optimal (see m)- Minor improvements with certain 
kinds of estimators, such as the James-Stein estimator, are possible. 
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4.1 Discussions 

Several remarks about Theorem [T] are in order. 

The high probability bound in Theorem [T] is with respect to the samples drawn in Si ; pro¬ 
vided these samples are representative (which happens with probability > 1 — <5), the output 02 
of Algorithm [T] will satisfy (HI . Additionally, Theorem [T] assumes that the labels are sampled with 
replacement] in other words, we can query the label of a point Xi multiple times. Removing this 
assumption is an avenue for future work. 

Second, the highest order term in both (IT| and ([2| is Tr /to. The terms 

involving and are lower order as both 6^2 and em 2 are o(l). Moreover, if ,5 = a;(l), then the 
term involving /3 in o is of a lower order as well. Observe that /3 also measures the tradeoff between 
TOi and TO 2 , and as long as j3 = o(^TO 2 ), toi is also of a lower order than m 2 . Thus, provided /3 is 
a;(l) and o(^TO 2 ), the convergence rate of our algorithm is optimal except for lower order terms. 

Finally, the lower bound ([2| applies to distributions T for which /r(0*) > clu{d*), where c occurs 
in the lower order terms of the bound. This constraint is not very restrictive, and does not affect 
the asymptotic rate. Observe that Iu{d*) is full rank. If /r(0*) is not full rank, then the expected 
log likelihood error of the ML estimate with respect to T will not be consistent, and thus such a 
r will never achieve the optimal rate. If /r(0*) is full rank, then there always exists a c for which 
Ir{9*) > clu{9*). Thus dH essentially states that for distributions T where Jr(0*) is close to being 
rank-deficient, the asymptotic convergence rate of 0(Tr /W 2 ) is achieved at larger 

values of TO 2 . 

4.2 Proof Outline 

Our main result relies on the following three steps. 

4.2.1 Bounding the Log-likelihood Error 

First, we characterize the log likelihood error (wrt U) of the empirical risk minimizer (ERM) estimate 
obtained using a sampling distribution F. Concretely, let F be a distribution on U. Let 0r be the 
ERM estimate using the distribution F: 


0r = argmin^ge— L{Yi\X„0) 
m 2 ^ 


2=1 


(3) 


where Xi ^ V and Yi ^ p{y\Xi,9*). 
estimate of the log likelihood error E 


The core of our analysis is Lemma [U which shows a precise 
Lu — Lu (9*) . 


Lemma 1. Suppose L satisfies the regularity conditions in Assumptions and [H Let F 6 e a 
distribution on U and 0r be the ERM estimate using m 2 labeled examples. Suppose further that 
Iri9*) ^ clu{9*) for some constant c < 1. Then, for any p > 2 and m 2 large enough (depending on 
p), we have: 


(1 £7712) 


TO2 


cm2^^ 


< E 


(0r) 


— Ljj (0*) < (1 -f em 2 )-1" 


R 


m 2 


where (L 1 L 3 + v^) and Tr (/r( 0 *)-'/c/( 0 *)) . 
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4.2.2 Approximating 0* 

Lemma[l]motivates sampling from the optimal sampling distribution F* that minimizes Tr ^/r* {9*) ^lu(0* 
However, this quantity depends on 9*, which we do not know. To resolve this issue, our algorithm 
first queries the labels of a small fraction of points (toi) and solves a ML estimation problem to 
obtain a coarse estimate 9i of 9*. 

How close should 9i be to 9*1 Our analysis indicates that it is sufficient for 6*i to be close enough 
that for any x, I{x,9i) is a constant factor spectral approximation to I{x, 9*)] the number of samples 
needed to achieve this is analyzed in Lemma [H 

Lemma 2. Suppose L satisfies the regularity conditions in Assumptions{Ii and\^ If the number of 
samples used in the first step 


mi > O \ max ( L 2 log^ d, 


Li 


Ll + 


log^ d, 


diameter{Q) 


Tr 


{lu{0*)-^) ’ 



then, we have: 

-^I (x, 9*) ^I{x,9i)-I [x, 9*)<^I {x, 9*)'ix€X 
with probability greater than 1 — <5. 


4.2.3 Computing Fi 


Third, we are left with the task of obtaining a distribution Fi that minimizes the log likelihood 
error. We now pose this optimization problem as an SDP. 

From Lemmas [Hand [21 it is clear that we should aim to obtain a sampling distribution F = : 

i G [n]) minimizing Tr Let Iu{di) = Tlij be the singular value decom¬ 
position (svd) of Iu{9i). Since Tr ajVj^I t{9i) ^Vj, this is equivalent to 

solving: 


d 

t=i 


s.t. 


S = Y.^ail{xi,9i) 

< cj 

Oi G [0,1] 

ai = m2. 


(4) 


Among the above constraints, the constraint S ^Vj < Cj seems problematic. However, Schur 

^0 S' ^ 0 and vj^S~^Vj < cj. In our case, we 

know that S ^ 0, since it is a sum of positive semi definite matrices. The above argument proves 
the following lemma. 

Lemma 3. The following two optimization programs are equivalent: 


complement formula tells us that: 


mina Tr[S ^Iu{9i)) 
s.t. S = Yi<^^I{xi,9i) 
Oi G [0,1] 

= m 2 . 


s.t. 


\—\d 
^j=i 

s = Yi°'iIi.Xi,9i) 

Vj b 

Gi G [0, 1] 

YiO'i = 


where Iu{di) = Yj 'XjVjvY denotes the svd of lu{0i). 
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5 Illustrative Examples 

We next present some examples that illustrate Theorem [TJ We begin by showing that Condition [T] 
is satisfied by the popular class of Generalized Linear Models. 

5.1 Derivations for Generalized Linear Models 

A generalized linear model is specified by three parameters - a linear model, a sufficient statistic, 
and a member of the exponential family. Let 77 be a linear model: ry = 0^X. Then, in a Gener¬ 
alized Linear Model (GLM), Y is drawn from an exponential family distribution with parameter 
77. Specifically, p(Y = y\ri) = where t(-) is the sufficient statistic and A(-) is the 

log-partition function. From properties of the exponential family, the log-likelihood is written as 
logp{y\r]) = r]^t{y) — A(r]). If we take rj = 6^x, and take the derivative with respect to 9, we have: 
diogp^y\9,x) _ _ xA'(9^x). Taking derivatives again gives us ^ _ —xx^A"{0^x), 

which is independent of y. 

5.2 Specific Examples 

We next present three illustrative examples of problems that our algorithm may be applied to. 

Linear Regression. Our first example is linear regression. In this case, x G and F G M are 
generated according to the distribution: Y = 0JA-|- 77 , where ?y is a noise variable drawn from A/'(0,1). 
In this case, the negative loglikelihood function is: L{y\x,9) = (y — 9^x)^, and the corresponding 
Fisher information matrix I{x,9) is given as: I{x,9) = xx^. Observe that in this (very special) 
case, the Fisher information matrix does not depend on 0; as a result we can eliminate the first two 
steps of the algorithm, and proceed directly to step 3. If E = i XiXi^ is the covariance matrix 
of U, then Theorem [T] tells us that we need to query labels from a distribution F* with covariance 
matrix A such that Tr (A“^E) is minimized. 

We illustrate the advantages of active learning through a simple example. Suppose U is the 
unlabelled distribution: 


x. = h^ w.p. 1-^^, 

* 1 ej w.p. for j G {2,--- ,4, 

where ej is the standard unit vector in the j**' direction. The covariance matrix E of t/ is a diagonal 
matrix with En = I — and Yjj = for j > 2. For passive learning over U, we query labels 

of examples drawn from U which gives us a convergence rate of —On the other hand, 
active learning chooses to sample examples from the distribution F* such that 

/ ei w.p. ^ I - 
\ Cj w.p. - ^ for j G {2, • • • , 4 , 

where ^ indicates that the probabilities hold upto O (^). This has a diagonal covariance ma- 

Tr r A~ ^ sl 

trix A such that An ^ 1 — and Kjj ^ ^ for j > 2, and convergence rate of —^ 
^ {Ml ■ + {d-l)-2d- < ^, which does not grow with d\ 

Logistic Regression. Our second example is logistic regression for binary classification. In this 
case, X G Y G {—1,1} and the negative log-likelihood function is: L{y\x,9) — log(l -I- ^), 

X 

and the corresponding Fisher information I{x,9) is given as: I{x,9) = -— ^ i ^ • xx^. 

For illustration, suppose ||0*||2 and ||a :||2 are bounded by a constant and the covariance matrix 
E is sandwiched between two multiples of identity in the PSD ordering i.e., ^ E ^ for 







some constants c and C. Then the regularity assumptions [T] and [5] are satisfied for constant values of 
Li, L 2 , L 3 and L 4 . In this case, Theorem [T] states that choosing mi to be w (Tr ( Iu{9 *)~^) ) = lo (d) 


gives us the optimal convergence rate of (1 + o(l)) 


Tr(jr»(e*)-^Jv(e*)) 

m2 


Multinomial Logistic Regression. Our third example is multinomial logistic regression for 
multi-class classification. In this case, Y G x G and the parameter matrix 9 G 

The negative log-likelihood function is written as: L{y\x,9) = —9^xY\og{\-\-Y^^=i 

a y K, and L{y = k\x, 9) = log(I -I- otherwise. The corresponding Fisher information 

matrix is a {K — l)d x {K — l)(i matrix, which is obtained as follows. Let F be the {K — I) x {K — I) 
matrix with: 




^9j X+ 0 J X 


Then, I{x,9) = F (Si xx^ . 

Similar to the example in the logistic regression case, suppose || 6 *y ||2 and ||a ;||2 are bounded 
by a constant and the covariance matrix E satisfies 2^ — ^ some constants c and C. 

Since F* = diag(p*) — p*p*^, where p* = P{y = i\x,9*), the boundedness of || 6 *j ||2 and ||a :||2 
implies that cl < F* F Cl for some constants c and C (depending on K). This means that 
^ I{x,9*) :< and so the regularity assumptions [T] and [D are satisfied with Li,L 2 ,L 3 and 
L 4 being constants. Theorem [T] again tells us that using a;(d) samples in the first step gives us the 
optimal convergence rate of maximum likelihood error. 


6 Conclusion 

In this paper, we provide an active learning algorithm for maximum likelihood estimation which 
provably achieves the optimal convergence rate (upto lower order terms) and uses only two rounds 
of interaction. Our algorithm applies in a very general setting, which includes Generalized Linear 
Models. 

There are several avenues of future work. Our algorithm involves solving an SDP which is 
computationally expensive; an open question is whether there is a more efficient, perhaps greedy, 
algorithm that achieves the same rate. A second open question is whether it is possible to remove 
the with replacement sampling assumption. A final question is what happens if Iu{9*) has a high 
condition number. In this case, our algorithm will require a large number of samples in the first 
stage; an open question is whether we can use a more sophisticated procedure in the first stage to 
reduce the label requirement. 
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A Proofs 


In order to prove Lemma[l] we use the following result which is a modification of |10j . In particular, 
the following lemma is a generalization of Theorem 5.1 from cni, and its proof (omitted here) follows 
from generalizing the proof of that theorem. 

Lemma 4. Suppose ipij''' j '>Pn ■ R ore random functions drawn iid from a distribution. Let 

P = E [tpi] and Q : R'^ —>• R be another function. Let 

e = argming^s E '4’i{0)j and 0* = argming^gP[9). 

i 

Assume: 

1. (Convexity of'ip): Assume that 'll; is convex (with probability one), 

2. (Smoothness ofip): Assume that ip is smooth in the following sense: the first, second and third 
derivatives exist at all interior points of S (with probability one), 

3. (Regularity conditions): Suppose 


(a) S is compact, 

(b) 9* is an interior point of S, 

(c) 'S/^P{9*) is positive definite (and hence invertible), 

(d) VQ{9*)=0, 

(e) There exists a neighborhood B of 9* and a constant L^ such that (with probability one), 
'S/'^ip(9) and V^(5(0) are L 3 Lipsehitz, namely 

|(v2Q(r))”^^" {V^Q{9) - V^Q{9')) 
for 9, 9' £ B, 

4 . (Concentration at 9*) Suppose || V'!/'(^*)||v 2 p( 6 C)-i ^ 

(V2p(r {V^P{9*))~^^^ 

hold with probability one. 

Choose p >2 and define 


< Z /3 ||0 9 ||y 2 p(g») , and 

< Z /3 \\9 — 9 ||y 2 p(g.), 


< L 2 


=^TiLiL3 + CL2] 


p log dn 


where c is an appropriately chosen constant. Let'S be another appropriately chosen constant. If n 


is large enough so that \ ^ < P min < , 

'' ” ' / L2 ^1^3 


1 diameter{B) 


, then: 


Li 

(1 - e„)- -777 < E 


n n‘ 


.p/2 - 


Q{9)-Q{9*) <(l + e„)—+ 
J n 


maxg(zsQ{9) - Q{9*) 
nP 


where 


^2 <fc/ I I I p(^0*y^Q(^g*)pi^g*y^ | . 
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The following lemma is a fundamental result relating the variance of the gradient of the log 
likelihood to Fisher information matrix for a large class of probability distributions m- 

Lemma 5. Suppose L satisfies the regularity conditionsin Assumptions\^ and\^ Then, for any 
example x, we have: 


®p(yk,e*) 


WL{Y\x,9*)WL{Y\x,9*)^ 


V^U9*). 


We now prove Lemma [TJ 
(Proof of Lemma{J^. We first define 


f:,{9)=L{Y\X,9), 


where X ^ F and Y ^ p(Y\X,9*) for i = I,-- - ,7712 and Q{9) = Lij{9). Using the notation of 
Lemma m this means that 

V^P{9*) = Ir{9*) and V^Q{9*) = Iu{9*). 


Using the regularity conditions from Section |4] and the hypothesis that Ir{9*) h clu{9*), we see 
that this satisfies the hypothesis of Lemma 0] with constants 

(Li, L2, L3) = (Lxj\fc,L2lc,L'il 

We now apply Lemma |4] to conclude that for large enough m 2 , we have: 


(1 - em2)'r^/w2 - 




< E 


Lu [9] - Lu{9*)\ < (1 + e^jT2/m2 + 


m 


p > 


where 


em2 ~ ^ 


2 def 
T = 


Ip l og dm2 \ 

m 2 j 

Tr(E VP{9*)VP{9*)^ It{9*)~^Iu{9*)It{0*)~^^ =Yt: (^Ir{9*)~^Iu{9*)^ , 




and 


using Lemma [ 5 ] in the last step. 

We now prove Lemma [21 
(Proof of Lemma\^. Define 


□ 


ij,{9)‘^^^L{Y\X,9), 


where X ^ U and Y ^ p{Y\X, 9*) for i = 1, • • • ,mi and Q{9) ||0 — 0*||2. Using the regularity 

conditions from Section 01 we see that this satisfies the hypothesis of Lemma 01 with constants 


(Li, L 2 , L 3 ) — (Li, L 2 , max ( L 3 , 
We now apply Lemma 01 to conclude that 


)) 


E 


Pi-9*\\l <(l + emJrVTOi + 


yj rTmin , 

diameter (6) 
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where e^i = O (^(Li max (^Ls, ^ 7 =) + v^), and 


= Tr E 


(e [wLu{e*)WLu{0*)^] Iu{dT^) = Tr (/[/(r)”^) , 


using Lemma [S] in the last step. By the choice of mi, we have that 


E 


11^1-r 11^ <2TVmi. 


Markov’s inequality then tells us that with probability at least 1 — (5, we have: 


\\ 0 i- 0 %< 


2 t^ 


< 


1 


5mi ~ P'^L\ 

Using Assumption [5] on point-wise self concordancy of I(x,9) now finishes the proof. 

(Proof of Theorem[If). The proof is a careful combination of Lemmas [I] [2] and [H 

Lower Bound: For any F that satisfies Ir{d*) ^ clu{d*), we can apply Lemma [T] to write: 


□ 


E 


Lu - Lu (0*) >( 1 - 67712 ) 


Tr {lr{0*)-^lu{0*)) Ll 


m 2 


The lower bound follows. 

Upper Bound: We begin by showing that if Assumptions [T] and H] are satisfied, then, from 
Lemma m we have that with probability > 1 — (5, it holds that: 

^/(x, 0 *) < /(x, 0i) ^ r ) V X G [/ 

with probability > 1 — <5. This means that the following hold for distributions Fi, F* and U : 


^/r-(r) ^ /r.( 0 i) ^ ^/r.(r), and 
^lu{0*) ^ lu{0i) ^ ^Iu{01- 


(5) 

( 6 ) 
(7) 


Since F = oFi -I- (1 — a)U, we have that /p(0*) > o/ri {0*) which further implies that l-p{0*) ^ di 
-Irii0*)~^- Similarly, since /p(0*) > (1 — a)lui0*), we can apply Lemma[T]on F to get: 


E|L„(.,) - (.-)] < (1+ 4 < i,i ^ 

1712 Tn2 a m 2 


ml 


Tr{lrA0*)-7u{0*)) R 
!^V-L-rem2j _ ”^^2’ 


m 2 


where 67712 : ^rn 2 = 


((T^ (^1^3 + VII) = O ({L,L, + yi;) ^ 

From ([5]) and 0 , the right hand side is at most: 

n , -r ^7/9 + 1^2Tr(/^7(0l)-%(0l)) , R 
4 + 6^2 4 1 


dm 2 

76 


m 2 
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By definition of Fi, this is at most: 


(1 + em2)( 


/3 + 1 Tr(/r.(gi)-i/t/(gi)) 

/3 - 1 m 2 


Finally, applying ([6|) and 0, we get that this is at most: 


(1 + em2)i 


/3 + 1 Tr(/r.(r)-iJ^(r)) 
/3 - 1 m 2 


The upper bound follows. 






